At the end of this session, you will have learned how to:
dplyr module to manipulate RxXdfData data objectsRxXdfData objects quickly and easilydplyrXdf package and when to use functions from the RevoScaleR packageMicrosoft R Family
Microsoft R Family
RevoScaleR package enables R users to manipulate data that is larger than memoryxdf (short for eXternal Data Frame), which are highly efficient out-of-memory objectsRevoScaleR functions have a dramatically different syntax from base R functionsdplyr package is an exceptionally popular, due to its appealing syntax, and it’s extensibilitydplyrXdf that exposes most of the dplyr functionality to xdf objectsdplyrXdf abstracts this task of file management, so that you can focus on the data itself, rather than the management of intermediate filesdplyr, or other base R packages, dplyrXdf allows you to work with data residing outside of memory, and therefore scales to datasets of arbitrary sizedplyr trainingdevtools)[github.com/hadley/devtools] (and if on a Windows machine, Rtools)your_name <- "alizaidi"
your_dir <- paste0('/datadrive/', your_name)
# File Path to your Data
your_data <- file.path(your_dir, 'tripdata_2015.xdf')
dir.create(your_dir)
## Warning in dir.create(your_dir): '/datadrive/alizaidi' already exists
download.file("http://alizaidi.blob.core.windows.net/training/trainingData/manhattan.xdf",
destfile = your_data)
library(dplyrXdf)
taxi_xdf <- RxXdfData(your_data)
taxi_xdf %>% head
## VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count
## 1 1 2015-07-01 00:00:00 2015-07-01 00:15:26 1
## 2 2 2015-07-01 00:00:06 2015-07-01 00:04:44 1
## 3 2 2015-07-01 00:00:09 2015-07-01 00:06:27 5
## 4 2 2015-07-01 00:00:12 2015-07-01 00:04:18 2
## 5 2 2015-07-01 00:00:16 2015-07-01 00:27:13 1
## 6 1 2015-07-01 00:00:18 2015-07-01 00:15:11 1
## trip_distance pickup_longitude pickup_latitude RatecodeID
## 1 3.50 -73.99416 40.75113 1
## 2 0.77 -73.98556 40.75554 1
## 3 1.12 -73.97540 40.75190 1
## 4 1.02 -74.01014 40.72051 1
## 5 10.43 -73.87294 40.77415 1
## 6 2.60 -73.98057 40.75100 1
## store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type
## 1 N -73.97682 40.78857 2
## 2 N -73.97373 40.75031 2
## 3 N -73.99106 40.75073 1
## 4 N -74.00793 40.73129 1
## 5 N -73.98093 40.76450 1
## 6 N -74.00441 40.73099 1
## fare_amount tip_amount tolls_amount tip_percent pickup_hour pickup_dow
## 1 14.0 0.00 0.00 0.0000000 10PM-1AM Wed
## 2 5.0 0.00 0.00 0.0000000 10PM-1AM Wed
## 3 6.5 2.34 0.00 0.3600000 10PM-1AM Wed
## 4 5.5 1.36 0.00 0.2472727 10PM-1AM Wed
## 5 32.5 11.80 5.54 0.3630769 10PM-1AM Wed
## 6 11.5 2.55 0.00 0.2217391 10PM-1AM Wed
## dropoff_hour dropoff_dow trip_duration pickup_nhood
## 1 10PM-1AM Wed 926 Garment District
## 2 10PM-1AM Wed 278 Garment District
## 3 10PM-1AM Wed 378 Turtle Bay
## 4 10PM-1AM Wed 246 Tribeca
## 5 10PM-1AM Wed 1617 La Guardia Airport
## 6 10PM-1AM Wed 893 Murray Hill
## dropoff_nhood Sample
## 1 Upper West Side Keep
## 2 Tudor City Keep
## 3 Garment District Keep
## 4 West Village Keep
## 5 Midtown Keep
## 6 Greenwich Village Keep
taxi_xdf %>% nrow
## [1] 13550925
class(taxi_xdf)
## [1] "RxXdfData"
## attr(,"package")
## [1] "RevoScaleR"
dplyrXdf package can also be completed by using the rxDataStep function in the RevoScaleR package included with your MRS installationdplyrXdf consists almost entirely of wrapper functions that call on other RevoScaleR functionsrxDataStep vs dplyrXdftaxi_xdf %>% rxGetInfo(getVarInfo = TRUE, numRows = 4)
## File name: /datadrive/alizaidi/tripdata_2015.xdf
## Number of observations: 13550925
## Number of variables: 24
## Number of blocks: 326
## Compression type: zlib
## Variable information:
## Var 1: VendorID, Type: integer, Low/High: (1, 2)
## Var 2: tpep_pickup_datetime, Type: character
## Var 3: tpep_dropoff_datetime, Type: character
## Var 4: passenger_count, Type: integer, Low/High: (1, 9)
## Var 5: trip_distance, Type: numeric, Storage: float32, Low/High: (0.0000, 29.9900)
## Var 6: pickup_longitude, Type: numeric, Storage: float32, Low/High: (-74.2413, -73.7004)
## Var 7: pickup_latitude, Type: numeric, Storage: float32, Low/High: (40.5256, 40.9128)
## Var 8: RatecodeID, Type: integer, Low/High: (1, 99)
## Var 9: store_and_fwd_flag, Type: character
## Var 10: dropoff_longitude, Type: numeric, Storage: float32, Low/High: (-74.2485, -73.7007)
## Var 11: dropoff_latitude, Type: numeric, Storage: float32, Low/High: (40.5027, 40.9138)
## Var 12: payment_type, Type: integer, Low/High: (1, 5)
## Var 13: fare_amount, Type: numeric, Storage: float32, Low/High: (0.0100, 2020.3700)
## Var 14: tip_amount, Type: numeric, Storage: float32, Low/High: (0.0000, 854.8500)
## Var 15: tolls_amount, Type: numeric, Storage: float32, Low/High: (0.0000, 912.5000)
## Var 16: tip_percent, Type: numeric, Low/High: (0.0000, 11600.0003)
## Var 17: pickup_hour
## 7 factor levels: 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
## Var 18: pickup_dow
## 7 factor levels: Sun Mon Tue Wed Thu Fri Sat
## Var 19: dropoff_hour
## 7 factor levels: 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
## Var 20: dropoff_dow
## 7 factor levels: Sun Mon Tue Wed Thu Fri Sat
## Var 21: trip_duration, Type: integer, Low/High: (1, 86399)
## Var 22: pickup_nhood
## 269 factor levels: Annadale Arden Heights Arrochar Arverne Astoria ... Wingate Woodhaven Woodlawn Woodrow Woodside
## Var 23: dropoff_nhood
## 269 factor levels: Annadale Arden Heights Arrochar Arverne Astoria ... Wingate Woodhaven Woodlawn Woodrow Woodside
## Var 24: Sample
## 2 factor levels: Keep Drop
## Data (4 rows starting with row 1):
## VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count
## 1 1 2015-07-01 00:00:00 2015-07-01 00:15:26 1
## 2 2 2015-07-01 00:00:06 2015-07-01 00:04:44 1
## 3 2 2015-07-01 00:00:09 2015-07-01 00:06:27 5
## 4 2 2015-07-01 00:00:12 2015-07-01 00:04:18 2
## trip_distance pickup_longitude pickup_latitude RatecodeID
## 1 3.50 -73.99416 40.75113 1
## 2 0.77 -73.98556 40.75554 1
## 3 1.12 -73.97540 40.75190 1
## 4 1.02 -74.01014 40.72051 1
## store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type
## 1 N -73.97682 40.78857 2
## 2 N -73.97373 40.75031 2
## 3 N -73.99106 40.75073 1
## 4 N -74.00793 40.73129 1
## fare_amount tip_amount tolls_amount tip_percent pickup_hour pickup_dow
## 1 14.0 0.00 0 0.0000000 10PM-1AM Wed
## 2 5.0 0.00 0 0.0000000 10PM-1AM Wed
## 3 6.5 2.34 0 0.3600000 10PM-1AM Wed
## 4 5.5 1.36 0 0.2472727 10PM-1AM Wed
## dropoff_hour dropoff_dow trip_duration pickup_nhood dropoff_nhood
## 1 10PM-1AM Wed 926 Garment District Upper West Side
## 2 10PM-1AM Wed 278 Garment District Tudor City
## 3 10PM-1AM Wed 378 Turtle Bay Garment District
## 4 10PM-1AM Wed 246 Tribeca West Village
## Sample
## 1 Keep
## 2 Keep
## 3 Keep
## 4 Keep
taxi_transform <- RxXdfData(your_data)
system.time(rxDataStep(inData = taxi_xdf,
outFile = taxi_transform,
transforms = list(tip_pct = tip_amount/fare_amount),
overwrite = TRUE))
## user system elapsed
## 18.184 1.340 73.184
rxGetInfo(RxXdfData(taxi_transform), numRows = 2)
## File name: /datadrive/alizaidi/tripdata_2015.xdf
## Number of observations: 13550925
## Number of variables: 25
## Number of blocks: 326
## Compression type: zlib
## Data (2 rows starting with row 1):
## VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count
## 1 1 2015-07-01 00:00:00 2015-07-01 00:15:26 1
## 2 2 2015-07-01 00:00:06 2015-07-01 00:04:44 1
## trip_distance pickup_longitude pickup_latitude RatecodeID
## 1 3.50 -73.99416 40.75113 1
## 2 0.77 -73.98556 40.75554 1
## store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type
## 1 N -73.97682 40.78857 2
## 2 N -73.97373 40.75031 2
## fare_amount tip_amount tolls_amount tip_percent pickup_hour pickup_dow
## 1 14 0 0 0 10PM-1AM Wed
## 2 5 0 0 0 10PM-1AM Wed
## dropoff_hour dropoff_dow trip_duration pickup_nhood dropoff_nhood
## 1 10PM-1AM Wed 926 Garment District Upper West Side
## 2 10PM-1AM Wed 278 Garment District Tudor City
## Sample tip_pct
## 1 Keep 0
## 2 Keep 0
dplyrXdf, using the exact same syntax that we learned in the dplyr module and taking advantage of the %>% operatorsystem.time(taxi_transform <- taxi_xdf %>% mutate(tip_pct = tip_amount/fare_amount))
## user system elapsed
## 16.672 1.284 72.019
taxi_transform %>% rxGetInfo(numRows = 2)
## File name: /tmp/RtmptR5PAR/file4e7e31120a5d.xdf
## Number of observations: 13550925
## Number of variables: 25
## Number of blocks: 326
## Compression type: zlib
## Data (2 rows starting with row 1):
## VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count
## 1 1 2015-07-01 00:00:00 2015-07-01 00:15:26 1
## 2 2 2015-07-01 00:00:06 2015-07-01 00:04:44 1
## trip_distance pickup_longitude pickup_latitude RatecodeID
## 1 3.50 -73.99416 40.75113 1
## 2 0.77 -73.98556 40.75554 1
## store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type
## 1 N -73.97682 40.78857 2
## 2 N -73.97373 40.75031 2
## fare_amount tip_amount tolls_amount tip_percent pickup_hour pickup_dow
## 1 14 0 0 0 10PM-1AM Wed
## 2 5 0 0 0 10PM-1AM Wed
## dropoff_hour dropoff_dow trip_duration pickup_nhood dropoff_nhood
## 1 10PM-1AM Wed 926 Garment District Upper West Side
## 2 10PM-1AM Wed 278 Garment District Tudor City
## Sample tip_pct
## 1 Keep 0
## 2 Keep 0
rxDataStep operation and the dplyrXdf method, is that we do not specify an outFile argument anywhere in the dplyrXdf pipelinemutate value to a new variable called taxi_transformxdf, and only saves the most recent output of a pipeline, where a pipeline is defined as all operations starting from a raw xdf file.persist verbtaxi_transform@file
## [1] "/tmp/RtmptR5PAR/file4e7e31120a5d.xdf"
persist(taxi_transform, outFile = "/datadrive/alizaidi/taxiTransform.xdf") -> taxi_transform
dplyrXdf package really shines when used for data aggregations and summarizationsrxSummary, rxCube, and rxCrossTabs can compute a few summary statistics and do aggregations very quickly, they are not sufficiently general to be used in all placestaxi_group <- taxi_transform %>%
group_by(pickup_nhood) %>%
summarise(ave_tip = mean(tip_pct))
taxi_group %>% head
## pickup_nhood ave_tip
## 1 Arden Heights 0.03018716
## 2 Arrochar 0.12730094
## 3 Arverne 0.10823474
## 4 Astoria 0.13915798
## 5 Astoria Heights 0.09144881
## 6 Auburndale 0.09283292
rxCube as well, but would require additional considerationspickup_nhood column was a factor (can’t mutate in place because of different data types)rxCube can only provide summations and averages, so we cannot get standard deviations for instance.rxFactors(inData = taxi_transform,
outFile = "/datadrive/alizaidi/taxi_factor.xdf",
factorInfo = c("pickup_nhood"),
overwrite = TRUE)
## Warning in factorInfoVarList(factorInfo[i], varInfo, sortLevelsDefault = sortLevels, :
## No changes will be made to the factor variable 'pickup_nhood'
## because 'sortLevels = FALSE' and there is no 'indexMap'.
## Warning in rxFactorsBase(inData = dataIO[["inData"]], factorInfo =
## factorInfo, : No changes made to the data set.
head(rxCube(tip_pct ~ pickup_nhood,
means = TRUE,
data = "/datadrive/alizaidi/taxi_factor.xdf"))
## pickup_nhood tip_pct Counts
## 1 Annadale NaN 0
## 2 Arden Heights 0.03018716 5
## 3 Arrochar 0.12730094 5
## 4 Arverne 0.10823474 16
## 5 Astoria 0.13915798 63662
## 6 Astoria Heights 0.09144881 1512
# file.remove("data/taxi_factor.xdf")
As we saw above, it’s pretty easy to create a summarization or aggregation script. We can encapsulate our aggregation into it’s own function. Suppose we wanted to calculate average tip as a function of dropoff and pickup neighborhoods. In the dplyr nonmenclature, this means grouping by dropoff and pickup neighborhoods, and summarizing/averaging tip percent.
rxGetInfo(taxi_transform, numRows = 5)
## File name: /datadrive/alizaidi/taxiTransform.xdf
## Number of observations: 13550925
## Number of variables: 25
## Number of blocks: 326
## Compression type: zlib
## Data (5 rows starting with row 1):
## VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count
## 1 1 2015-07-01 00:00:00 2015-07-01 00:15:26 1
## 2 2 2015-07-01 00:00:06 2015-07-01 00:04:44 1
## 3 2 2015-07-01 00:00:09 2015-07-01 00:06:27 5
## 4 2 2015-07-01 00:00:12 2015-07-01 00:04:18 2
## 5 2 2015-07-01 00:00:16 2015-07-01 00:27:13 1
## trip_distance pickup_longitude pickup_latitude RatecodeID
## 1 3.50 -73.99416 40.75113 1
## 2 0.77 -73.98556 40.75554 1
## 3 1.12 -73.97540 40.75190 1
## 4 1.02 -74.01014 40.72051 1
## 5 10.43 -73.87294 40.77415 1
## store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type
## 1 N -73.97682 40.78857 2
## 2 N -73.97373 40.75031 2
## 3 N -73.99106 40.75073 1
## 4 N -74.00793 40.73129 1
## 5 N -73.98093 40.76450 1
## fare_amount tip_amount tolls_amount tip_percent pickup_hour pickup_dow
## 1 14.0 0.00 0.00 0.0000000 10PM-1AM Wed
## 2 5.0 0.00 0.00 0.0000000 10PM-1AM Wed
## 3 6.5 2.34 0.00 0.3600000 10PM-1AM Wed
## 4 5.5 1.36 0.00 0.2472727 10PM-1AM Wed
## 5 32.5 11.80 5.54 0.3630769 10PM-1AM Wed
## dropoff_hour dropoff_dow trip_duration pickup_nhood
## 1 10PM-1AM Wed 926 Garment District
## 2 10PM-1AM Wed 278 Garment District
## 3 10PM-1AM Wed 378 Turtle Bay
## 4 10PM-1AM Wed 246 Tribeca
## 5 10PM-1AM Wed 1617 La Guardia Airport
## dropoff_nhood Sample tip_pct
## 1 Upper West Side Keep 0.0000000
## 2 Tudor City Keep 0.0000000
## 3 Garment District Keep 0.3600000
## 4 West Village Keep 0.2472727
## 5 Midtown Keep 0.3630769
mht_url <- "http://alizaidi.blob.core.windows.net/training/manhattan.rds"
manhattan_hoods <- readRDS(gzcon(url(mht_url)))
taxi_transform %>%
filter(pickup_nhood %in% mht_hoods,
dropoff_nhood %in% mht_hoods,
.rxArgs = list(transformObjects = list(mht_hoods = manhattan_hoods))) %>%
group_by(dropoff_nhood, pickup_nhood) %>%
summarize(ave_tip = mean(tip_pct),
ave_dist = mean(trip_distance)) %>%
filter(ave_dist > 3, ave_tip > 0.05) -> sum_df
sum_df %>% rxGetInfo(getVarInfo = TRUE, numRows = 5)
## File name: /tmp/RtmptR5PAR/file4e7e2eb0c46a.xdf
## Number of observations: 326
## Number of variables: 4
## Number of blocks: 1
## Compression type: zlib
## Variable information:
## Var 1: dropoff_nhood
## 269 factor levels: Annadale Arden Heights Arrochar Arverne Astoria ... Wingate Woodhaven Woodlawn Woodrow Woodside
## Var 2: pickup_nhood
## 269 factor levels: Annadale Arden Heights Arrochar Arverne Astoria ... Wingate Woodhaven Woodlawn Woodrow Woodside
## Var 3: ave_tip, Type: numeric, Low/High: (0.0714, 0.2558)
## Var 4: ave_dist, Type: numeric, Low/High: (3.0054, 13.9174)
## Data (5 rows starting with row 1):
## dropoff_nhood pickup_nhood ave_tip ave_dist
## 1 Central Park Battery Park 0.1027538 6.046687
## 2 Clinton Battery Park 0.1218306 3.896076
## 3 East Harlem Battery Park 0.1495259 9.396812
## 4 East Village Battery Park 0.1433366 4.054086
## 5 Garment District Battery Park 0.1237102 3.905760
class(sum_df)
## [1] "grouped_tbl_xdf"
## attr(,"package")
## [1] "dplyrXdf"
Alternatively, we can encapsulate this script into a function, so that we can easily call it in a functional pipeline.
taxi_hood_sum <- function(taxi_data = taxi_df, ...) {
taxi_data %>%
filter(pickup_nhood %in% manhattan_hoods,
dropoff_nhood %in% manhattan_hoods, ...) %>%
group_by(dropoff_nhood, pickup_nhood) %>%
summarize(ave_tip = mean(tip_pct),
ave_dist = mean(trip_distance)) %>%
filter(ave_dist > 3, ave_tip > 0.05) -> sum_df
return(sum_df)
}
The resulting summary object isn’t very large (about 408 rows in this case), so it shouldn’t cause any memory overhead issues if we covert it now to a data.frame. We can plot our results using our favorite plotting library.
tile_plot_hood <- function(df = taxi_hood_sum()) {
library(ggplot2)
ggplot(data = df, aes(x = pickup_nhood, y = dropoff_nhood)) +
geom_tile(aes(fill = ave_tip), colour = "white") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = 'bottom') +
scale_fill_gradient(low = "white", high = "steelblue") -> gplot
return(gplot)
}
# tile_plot_hood(as.data.frame(sum_df))
taxi_transform <- taxi_xdf %>% mutate(tip_pct = tip_amount/fare_amount)
library(plotly)
sum_df <- taxi_hood_sum(taxi_transform,
.rxArgs = list(transformObjects = list(manhattan_hoods = manhattan_hoods))) %>%
persist("/datadrive/alizaidi/summarized.xdf")
ggplotly(tile_plot_hood(as.data.frame(sum_df)))
The do verb is an exception to the rule that dplyrXdf verbs write their output as xdf files. This is because do executes arbitrary R code, and can return arbitrary R objects; while a data frame is capable of storing these objects, an xdf file is limited to character and numeric vectors only.
The doXdf verb is similar to do, but where do splits its input into one data frame per group, doXdf splits it into one xdf file per group. This allows do-like functionality with grouped data, where each group can be arbitrarily large. The syntax for the two functions is essentially the same, although the code passed to doXdf must obviously know how to handle xdfs.
taxi_models <- taxi_xdf %>% group_by(pickup_dow) %>% doXdf(model = rxLinMod(tip_amount ~ fare_amount, data = .))
taxi_models
## Source: local data frame [7 x 2]
## Groups: <by row>
##
## # A tibble: 7 × 2
## pickup_dow model
## * <fctr> <list>
## 1 Fri <S3: rxLinMod>
## 2 Mon <S3: rxLinMod>
## 3 Sat <S3: rxLinMod>
## 4 Sun <S3: rxLinMod>
## 5 Thu <S3: rxLinMod>
## 6 Tue <S3: rxLinMod>
## 7 Wed <S3: rxLinMod>
taxi_models$model[[1]]
## Call:
## rxLinMod(formula = tip_amount ~ fare_amount, data = .)
##
## Linear Regression Results for: tip_amount ~ fare_amount
## Data: . (RxXdfData Data Source)
## File name: /tmp/RtmptR5PAR/tripdata_2015.pickup_dow.Fri.xdf
## Dependent variable(s): tip_amount
## Total independent variables: 2
## Number of valid observations: 2030949
## Number of missing observations: 0
##
## Coefficients:
## tip_amount
## (Intercept) 0.1226685
## fare_amount 0.1260427
All the caveats that go with working with data.frames apply here. While each grouped partition is it’s own RxXdfData object, the return value must be a data.frame, and hence, must fit in memory. Moreover, the function you apply against the splits will determine how they are operated. If you use an rx function, you’ll get the nice fault-tolerant, parallel execution strategies the RevoScaleR package provides, but for any vanilla/CRAN function will work with data.frames and can easily cause your session to crash.
library(broom)
taxi_broom <- taxi_xdf %>% group_by(pickup_dow) %>% doXdf(model = lm(tip_amount ~ fare_amount, data = .))
Now we can apply the broom::tidy function at the row level to get summary statistics:
library(broom)
tbl_df(taxi_broom) %>% tidy(model)
## Source: local data frame [14 x 6]
## Groups: pickup_dow [7]
##
## pickup_dow term estimate std.error statistic p.value
## <fctr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Sun (Intercept) 0.03730041 0.0024937318 14.95767 1.397698e-50
## 2 Sun fare_amount 0.12631622 0.0001514979 833.78190 0.000000e+00
## 3 Mon (Intercept) 0.02654646 0.0023435719 11.32735 9.629607e-30
## 4 Mon fare_amount 0.13543983 0.0001409857 960.66361 0.000000e+00
## 5 Tue (Intercept) 0.06650145 0.0022397638 29.69128 1.103022e-193
## 6 Tue fare_amount 0.13526525 0.0001384011 977.34249 0.000000e+00
## 7 Wed (Intercept) 0.03476736 0.0023254903 14.95055 1.554454e-50
## 8 Wed fare_amount 0.13861577 0.0001423052 974.07407 0.000000e+00
## 9 Thu (Intercept) 0.06643491 0.0024278106 27.36413 7.858925e-165
## 10 Thu fare_amount 0.13565675 0.0001462961 927.27526 0.000000e+00
## 11 Fri (Intercept) 0.12266850 0.0023882664 51.36299 0.000000e+00
## 12 Fri fare_amount 0.12604272 0.0001445679 871.85803 0.000000e+00
## 13 Sat (Intercept) 0.17156504 0.0020854509 82.26760 0.000000e+00
## 14 Sat fare_amount 0.11094274 0.0001353946 819.40287 0.000000e+00